This study seeks to define the key features that describe a habitable exoplanet through Prinicipal Component Analysis (PCA), Singular Value Decomposition (SVD), and Non-Negative Matrix Factorization (NMF). A concise version of the methodology is as follows:
Our results show us that confirmed exoplanets have fundamentally different important features than that of candidate and false positive exoplanets.
Have you ever wondered if there is life outside of our planet? Are there any earth-like planets beyond our solar system that could be able to support life? So far, the only life that we know of is here in our planet Earth.
Plastics are broken down into microplastics and are now found in riverbanks, glaciers and even inside fishes. 75% of the planet consists of bodies of water without much ocean cleanup initiatives. With the deadline for irreversible damages brought about by global warming and many may have given up hope for this planet, is it time to find another and start over?
Finding a suitable planet for human habitation or at least, has the possibility of supporting life on its surface has its set of criteria. Some of which are stellar distance, planetary size, and composition. A candidate planet should be within the habitable zone or the goldilocks zone. This means that the planet should be far enough orbiting their star that it can absorb sufficient heat (has access to solar energy) and yet far enough to support liquid water on its surface.
Back in 2009, NASA's Kepler Space Telescope was launched to detect exoplanets that can be habitable for humans. Since then, thousands of planets outside of our solar system have been discovered. The said telescope uses transit method which detects the motion of planets whose orbit are seen edge-on from Earth whenever they crossed the line of sight between their star and Earth. These crossings, or transits, cause a periodic dimming of the star's glare which are being monitored by Kepler’s photometer.
This initiative ended in 2013 due to mechanical issues but since telescope was still functional, the mission was extended and named K2.
Transit Method\ Bright glares from the stars can hinder detection of exoplanets. Because of this, direct detection through telescopes were not reliable. Hence, astronomers developed a way to indirectly detect these objects of interests: to look at the effects that these exoplanets have as they orbit their star. As they pass infront (relative to the telescope) of the star, they temporarily block some of the star's light. Hence, it may seem that this particular light has dimmed and is producing less brightness for a particular time period. Depending on the amount of change, we could infer the size of the planet.
To demonstrate, imagine you have a flashlight pointed into a wall without any obstruction, now place an object, say a ball, in between the flashlight and the wall. You might notice that the light that is now being projected into the wall changed depending on the balls's size, material, and its distance from the flashlight. These observations will then be data to infer the ball's characteristics.
Until its decommissioning in 2018, Kepler was able to send volumes of data which are now available to the public through the NASA Exoplanet Archive of the NASA Exoplanet Science Institute.
We clustered exoplanet using factor analysis tools such as Principal Component Analysis (PCA), Singular Value Decomposition (SVD), and Non-Negative Matrix Factorization (NMF).
This study used the data from the NASA Exoplanet Archive. It is an online catalog and data service for astronomical and stars and exoplanets that is reviewed and evaluated by a team of astronomers. It collates and cross-correlates astronomical data and information on exoplanets and their host stars and provides services to work with these data.
Among the services that this archive provides is an Application Programming Interface (API) which can return data about exoplanets. This study has used this service to retrieve the Cumulative Kepler Objects of Interest Table.
The dataset uses the prefix KOI which stands for Kepler object of interest. The variables of interest are as follows:
The whole dataset contains 9,565 observations (rows) and 50 features.
The process of determining similar planets was initialized by data collection from https://exoplanetarchive.ipac.caltech.edu through the use of Application Programming Interface (API) Calls. The resulting dataset was then split into the dispositions confirmed, candidate, and false positive exoplanets.
PCA was done using the whole dataset as well as one for each of the previously mentioned dispositions. Performing PCA on the different subclasses or dispositions provided the team with further insights on the relevant features describing each.
An 80% cumulative variance explained was used to have a consistent n(please replace with the actual number) across all the subclasses or dispositions. Similar to PCA, SVD was done to check whether we reduce the number of singular values for an 80% cumulative variance explained.
In order to determine the latent factors critical for clustering the data, NMF was done. This process verifies the effectivity of predicting "confirmed" exoplanets in our universe.
import pandas as pd
import numpy as np
import seaborn as sns
import requests
import json
import matplotlib.pyplot as plt
from scipy.spatial.distance import euclidean
from sklearn.preprocessing import StandardScaler
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import PCA
Matplotlib created a temporary config/cache directory at /tmp/matplotlib-k1kpmp3_ because the default path (/home/fegango/.cache/matplotlib) is not a writable directory; it is highly recommended to set the MPLCONFIGDIR environment variable to a writable directory, in particular to speed up the import of Matplotlib and to better support multiprocessing.
Originally contains 50 features and further reduced to 13 for interpretability. Most of the other features are error measurements like koi_impact_err1 and koi_prad_err2 or other irrelevant features like kepler_name and kepid. Using these error measurements might affect the interpretability of our analysis, so we decided to drop these features.
PCA was done using the whole dataset as well as one for each of the previously mentioned dispositions. Performing PCA on the different subclasses or dispositions provided the team with further insights on the relevant features describing each.
def _to_float(elem):
try:
return float(elem)
except:
return np.nan
def _map_disp(elem, _map = {"FALSE POSITIVE": 0, "CANDIDATE": 1, "CONFIRMED": 2}):
# _map = {"FALSE POSITIVE": 0, "CANDIDATE": 1, "CONFIRMED": 2}
# _map = {"FALSE POSITIVE": 0, "CANDIDATE": 0, "CONFIRMED": 1}
try:
# print(val)
return _map[elem]
except:
return np.nan
def _get_data(split_disposition=False):
"""
split_disposition: bool
- splits original df into 3 based on koi_disposition. returns all 4 dfs
(3 split dfs + original)
"""
col_drop = ['kepid', 'kepoi_name', 'kepler_name', #'koi_disposition',
'koi_pdisposition', 'koi_score', 'koi_fpflag_nt', 'koi_fpflag_ss',
'koi_fpflag_co', 'koi_fpflag_ec', 'koi_period_err1',
'koi_period_err2', 'koi_time0bk_err1',
'koi_time0bk_err2', 'koi_impact_err1', 'koi_impact_err2',
'koi_duration_err1', 'koi_duration_err2', 'koi_tce_delivname',
'koi_depth_err1', 'koi_depth_err2', 'koi_prad_err1',
'koi_prad_err2', 'koi_teq_err1', 'koi_teq_err2',
'koi_insol_err1', 'koi_insol_err2', 'koi_steff_err1', 'koi_steff_err2',
'koi_slogg', 'koi_slogg_err1', 'koi_slogg_err2',
'koi_srad_err1', 'koi_srad_err2', 'ra_str', 'dec_str', 'koi_kepmag_err']
url = 'https://exoplanetarchive.ipac.caltech.edu/cgi-bin/nstedAPI/nph-nstedAPI?table=cumulative'
res = requests.get(url).text.split('\n')
df = pd.DataFrame([elem.split(',') for elem in res])
df.columns = df.iloc[0]
df = df[1:].drop(col_drop, axis=1)
df.koi_disposition = df.koi_disposition.apply(lambda val: _map_disp(val))
df = df.apply(lambda col: col.apply(lambda elem: _to_float(elem))).dropna()
if not split_disposition:
return df
fp_df = df.query('koi_disposition == 0')
can_df = df.query('koi_disposition == 1')
con_df = df.query('koi_disposition == 2')
return fp_df, can_df, con_df, df
# all_df = _get_data()
fp_df, can_df, con_df, all_df = _get_data(split_disposition=True)
display(fp_df)
display(can_df)
display(con_df)
display(all_df)
| koi_disposition | koi_period | koi_time0bk | koi_impact | koi_duration | koi_depth | koi_prad | koi_teq | koi_insol | koi_model_snr | koi_tce_plnt_num | koi_steff | koi_srad | koi_kepmag | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 4 | 0.0 | 1.736952 | 170.307565 | 1.276 | 2.40641 | 8079.2 | 33.46 | 1395.0 | 891.96 | 505.6 | 1.0 | 5805.0 | 0.791 | 15.597 |
| 9 | 0.0 | 7.361790 | 132.250530 | 1.169 | 5.02200 | 233.7 | 39.21 | 1342.0 | 767.22 | 47.7 | 1.0 | 6227.0 | 1.958 | 12.660 |
| 15 | 0.0 | 11.521446 | 170.839688 | 2.483 | 3.63990 | 17984.3 | 150.51 | 753.0 | 75.88 | 622.1 | 1.0 | 5795.0 | 0.848 | 15.472 |
| 16 | 0.0 | 19.403938 | 172.484253 | 0.804 | 12.21550 | 8918.7 | 7.18 | 523.0 | 17.69 | 214.7 | 1.0 | 5043.0 | 0.680 | 15.487 |
| 17 | 0.0 | 19.221389 | 184.552164 | 1.065 | 4.79843 | 74284.0 | 49.29 | 698.0 | 55.97 | 2317.0 | 1.0 | 6117.0 | 0.947 | 15.341 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 9558 | 0.0 | 373.893980 | 261.496800 | 0.963 | 27.66000 | 730.0 | 2.51 | 206.0 | 0.42 | 18.5 | 3.0 | 5263.0 | 0.699 | 14.911 |
| 9559 | 0.0 | 8.589871 | 132.016100 | 0.765 | 4.80600 | 87.7 | 1.11 | 929.0 | 176.40 | 8.4 | 1.0 | 5638.0 | 1.088 | 14.478 |
| 9560 | 0.0 | 0.527699 | 131.705093 | 1.252 | 3.22210 | 1579.2 | 29.35 | 2088.0 | 4500.53 | 453.3 | 1.0 | 5638.0 | 0.903 | 14.082 |
| 9562 | 0.0 | 0.681402 | 132.181750 | 0.147 | 0.86500 | 103.6 | 1.07 | 2218.0 | 5713.41 | 12.3 | 1.0 | 6173.0 | 1.041 | 15.385 |
| 9564 | 0.0 | 4.856035 | 135.993300 | 0.134 | 3.07800 | 76.7 | 1.05 | 1266.0 | 607.42 | 8.2 | 1.0 | 6469.0 | 1.193 | 14.826 |
4381 rows × 14 columns
| koi_disposition | koi_period | koi_time0bk | koi_impact | koi_duration | koi_depth | koi_prad | koi_teq | koi_insol | koi_model_snr | koi_tce_plnt_num | koi_steff | koi_srad | koi_kepmag | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 3 | 1.0 | 19.899140 | 175.850252 | 0.969 | 1.78220 | 10829.0 | 14.60 | 638.0 | 39.30 | 76.3 | 1.0 | 5853.0 | 0.868 | 15.436 |
| 38 | 1.0 | 4.959319 | 172.258529 | 0.831 | 2.22739 | 9802.0 | 12.21 | 1103.0 | 349.40 | 696.5 | 1.0 | 5712.0 | 1.082 | 15.263 |
| 59 | 1.0 | 40.419504 | 173.564690 | 0.911 | 3.36200 | 6256.0 | 7.51 | 467.0 | 11.29 | 36.9 | 1.0 | 5446.0 | 0.781 | 15.487 |
| 63 | 1.0 | 7.240661 | 137.755450 | 1.198 | 0.55800 | 556.4 | 19.45 | 734.0 | 68.63 | 13.7 | 2.0 | 5005.0 | 0.765 | 15.334 |
| 64 | 1.0 | 3.435916 | 132.662400 | 0.624 | 3.13300 | 23.2 | 0.55 | 1272.0 | 617.61 | 8.7 | 3.0 | 5779.0 | 1.087 | 12.791 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 9539 | 1.0 | 7.268182 | 135.934800 | 0.780 | 4.98500 | 46.7 | 1.66 | 1444.0 | 1027.95 | 9.7 | 1.0 | 6297.0 | 2.219 | 13.729 |
| 9543 | 1.0 | 376.379890 | 486.602200 | 0.305 | 13.99000 | 1140.0 | 3.26 | 265.0 | 1.16 | 13.3 | 1.0 | 6231.0 | 0.955 | 15.632 |
| 9553 | 1.0 | 367.947848 | 416.209980 | 0.902 | 4.24900 | 1301.0 | 3.72 | 228.0 | 0.64 | 10.7 | 1.0 | 5570.0 | 0.855 | 15.719 |
| 9561 | 1.0 | 1.739849 | 133.001270 | 0.043 | 3.11400 | 48.5 | 0.72 | 1608.0 | 1585.81 | 10.6 | 1.0 | 6119.0 | 1.031 | 14.757 |
| 9563 | 1.0 | 333.486169 | 153.615010 | 0.214 | 3.19900 | 639.1 | 19.30 | 557.0 | 22.68 | 14.0 | 1.0 | 4989.0 | 7.824 | 10.998 |
1905 rows × 14 columns
| koi_disposition | koi_period | koi_time0bk | koi_impact | koi_duration | koi_depth | koi_prad | koi_teq | koi_insol | koi_model_snr | koi_tce_plnt_num | koi_steff | koi_srad | koi_kepmag | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 2.0 | 9.488036 | 170.53875 | 0.146 | 2.9575 | 615.8 | 2.26 | 793.0 | 93.59 | 35.8 | 1.0 | 5455.0 | 0.927 | 15.347 |
| 2 | 2.0 | 54.418383 | 162.51384 | 0.586 | 4.5070 | 874.8 | 2.83 | 443.0 | 9.11 | 25.8 | 2.0 | 5455.0 | 0.927 | 15.347 |
| 5 | 2.0 | 2.525592 | 171.59555 | 0.701 | 1.6545 | 603.3 | 2.75 | 1406.0 | 926.16 | 40.9 | 1.0 | 6031.0 | 1.046 | 15.509 |
| 6 | 2.0 | 11.094321 | 171.20116 | 0.538 | 4.5945 | 1517.5 | 3.90 | 835.0 | 114.81 | 66.5 | 1.0 | 6046.0 | 0.972 | 15.714 |
| 7 | 2.0 | 4.134435 | 172.97937 | 0.762 | 3.1402 | 686.0 | 2.77 | 1160.0 | 427.65 | 40.2 | 2.0 | 6046.0 | 0.972 | 15.714 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 8818 | 2.0 | 4.485592 | 135.43172 | 0.442 | 0.8161 | 2265.0 | 0.95 | 365.0 | 4.20 | 15.0 | 1.0 | 3236.0 | 0.193 | 15.737 |
| 8957 | 2.0 | 8.152759 | 134.19046 | 0.461 | 1.7460 | 16536.0 | 2.44 | 305.0 | 2.03 | 9.2 | 3.0 | 3327.0 | 0.189 | 17.475 |
| 9015 | 2.0 | 384.847556 | 314.97000 | 0.059 | 9.9690 | 189.9 | 1.09 | 220.0 | 0.56 | 12.3 | 1.0 | 5579.0 | 0.798 | 13.426 |
| 9084 | 2.0 | 3.875943 | 134.84758 | 0.025 | 2.3140 | 58.6 | 0.68 | 1081.0 | 323.21 | 14.2 | 1.0 | 5713.0 | 0.893 | 12.750 |
| 9182 | 2.0 | 24.278380 | 154.51325 | 0.717 | 4.5640 | 714.4 | 2.03 | 432.0 | 8.27 | 13.6 | 6.0 | 4450.0 | 0.707 | 14.164 |
2659 rows × 14 columns
| koi_disposition | koi_period | koi_time0bk | koi_impact | koi_duration | koi_depth | koi_prad | koi_teq | koi_insol | koi_model_snr | koi_tce_plnt_num | koi_steff | koi_srad | koi_kepmag | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 2.0 | 9.488036 | 170.538750 | 0.146 | 2.95750 | 615.8 | 2.26 | 793.0 | 93.59 | 35.8 | 1.0 | 5455.0 | 0.927 | 15.347 |
| 2 | 2.0 | 54.418383 | 162.513840 | 0.586 | 4.50700 | 874.8 | 2.83 | 443.0 | 9.11 | 25.8 | 2.0 | 5455.0 | 0.927 | 15.347 |
| 3 | 1.0 | 19.899140 | 175.850252 | 0.969 | 1.78220 | 10829.0 | 14.60 | 638.0 | 39.30 | 76.3 | 1.0 | 5853.0 | 0.868 | 15.436 |
| 4 | 0.0 | 1.736952 | 170.307565 | 1.276 | 2.40641 | 8079.2 | 33.46 | 1395.0 | 891.96 | 505.6 | 1.0 | 5805.0 | 0.791 | 15.597 |
| 5 | 2.0 | 2.525592 | 171.595550 | 0.701 | 1.65450 | 603.3 | 2.75 | 1406.0 | 926.16 | 40.9 | 1.0 | 6031.0 | 1.046 | 15.509 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 9560 | 0.0 | 0.527699 | 131.705093 | 1.252 | 3.22210 | 1579.2 | 29.35 | 2088.0 | 4500.53 | 453.3 | 1.0 | 5638.0 | 0.903 | 14.082 |
| 9561 | 1.0 | 1.739849 | 133.001270 | 0.043 | 3.11400 | 48.5 | 0.72 | 1608.0 | 1585.81 | 10.6 | 1.0 | 6119.0 | 1.031 | 14.757 |
| 9562 | 0.0 | 0.681402 | 132.181750 | 0.147 | 0.86500 | 103.6 | 1.07 | 2218.0 | 5713.41 | 12.3 | 1.0 | 6173.0 | 1.041 | 15.385 |
| 9563 | 1.0 | 333.486169 | 153.615010 | 0.214 | 3.19900 | 639.1 | 19.30 | 557.0 | 22.68 | 14.0 | 1.0 | 4989.0 | 7.824 | 10.998 |
| 9564 | 0.0 | 4.856035 | 135.993300 | 0.134 | 3.07800 | 76.7 | 1.05 | 1266.0 | 607.42 | 8.2 | 1.0 | 6469.0 | 1.193 | 14.826 |
8945 rows × 14 columns
sns.pairplot(all_df, hue="koi_disposition", diag_kind='kde')
pass
From the pairplots, we don't see any visible clustering in the scatter plots, but there is a clear correlation in koi_insol(Insolation Flux) and koi_teq(Equilibrium Temperature). This is expected since both variables involve different methods in measuring the planet's equilibrium temperature.
Feature and Target Variable Assignment_
# Features
_all_X = all_df.iloc[:, 1:]
_fp_X = fp_df.iloc[:, 1:]
_can_X = can_df.iloc[:, 1:]
_con_X = con_df.iloc[:, 1:]
# Target
_all_y = all_df.koi_disposition
_fp_y = fp_df.koi_disposition
_can_y = can_df.koi_disposition
_con_y = con_df.koi_disposition
Principal Component Analysis
We perform a PCA on the entire dataset to try to see whether there are observable patterns in their state space. In addition, we also perform PCA on the different dispositions to identify features that are important to each.
def pca(X, n=2):
X = (X - X.mean(axis=0)) / X.std(axis=0)
pca = PCA(n)
X_new = pca.fit_transform(X)
return X_new, pca.components_.T, pca.explained_variance_ratio_
_all_new, w_all, variance_explained_all = pca(_all_X, n=None)
_fp_new, w_fp, variance_explained_fp = pca(_fp_X, n=None)
_can_new, w_can, variance_explained_can = pca(_can_X, n=None)
_con_new, w_con, variance_explained_con = pca(_con_X, n=None)
df = pd.DataFrame(data = _all_new[:, :2],
columns = ['PC1', 'PC2'])
df.reset_index(drop=True, inplace=True)
_all_y.reset_index(drop=True, inplace=True)
result_df = pd.concat([df, _all_y], axis=1)
# Visualize Principal Components with a scatter plot
fig = plt.figure(figsize = (12,10))
ax = fig.add_subplot(1,1,1)
ax.set_xlabel('First Principal Component ', fontsize = 15)
ax.set_ylabel('Second Principal Component ', fontsize = 15)
ax.set_title('Principal Component Analysis (2PCs) for ExoPlanet Dataset', fontsize = 20)
targets = [0, 1, 2]
colors = ['r', 'g', 'b']
for target, color in zip(targets, colors):
indicesToKeep = all_df.koi_disposition == target
ax.scatter(result_df.loc[indicesToKeep, 'PC1'],
result_df.loc[indicesToKeep, 'PC2'],
c = color,
s = 10)
ax.legend(['False Positive', 'Candidate', 'Confirmed'])
features = list(_all_X.columns)
for feature, vec in zip(features, w_all):
plt.arrow(0, 0, 20*vec[0], 20*vec[1], width=0.1, ec='none', fc='r')
plt.text(25*vec[0], 25*vec[1], feature, ha='center', color='r')
ax.grid()
plt.rcParams["figure.figsize"] = (15, 5)
fig1, ax1 = plt.subplots(1, 3)
ax1[0].scatter(_fp_new[:, 0], _fp_new[:, 1])
ax1[1].scatter(_can_new[:, 0], _can_new[:, 1])
ax1[2].scatter(_con_new[:, 0], _con_new[:, 1])
ax1[0].set_xlabel("PC1_fp")
ax1[1].set_xlabel("PC1_can")
ax1[2].set_xlabel("PC1_con")
ax1[0].set_ylabel("PC2_fp")
ax1[1].set_ylabel("PC2_can")
ax1[2].set_ylabel("PC2_con")
ax1[0].set_title("False Positive")
ax1[1].set_title("Candidate")
ax1[2].set_title("Confirmed")
for feature, vec in zip(features, w_fp):
ax1[0].arrow(0, 0, 20*vec[0], 20*vec[1], width=0.1, ec='none', fc='r')
ax1[0].text(25*vec[0], 25*vec[1], feature, ha='center', color='r')
for feature, vec in zip(features, w_can):
ax1[1].arrow(0, 0, 20*vec[0], 20*vec[1], width=0.1, ec='none', fc='r')
ax1[1].text(25*vec[0], 25*vec[1], feature, ha='center', color='r')
for feature, vec in zip(features, w_con):
ax1[2].arrow(0, 0, 20*vec[0], 20*vec[1], width=0.1, ec='none', fc='r')
ax1[2].text(25*vec[0], 25*vec[1], feature, ha='center', color='r')
The interpretation of first two PCs for each subclass is as follows:
False Positive
Candidate
Confirmed
From our interpretations of the principal components, we see that the interpreted features for the confirmed exoplanets are quite different from the interpreted features of the other two dispositions. This insight may be helpful in evaluating the status candidate exoplanets. Round Earth FTW! **
plt.rcParams["figure.figsize"] = (15,15)
fig1, ax1 = plt.subplots(4, 1)
ax1[0].plot(range(1, len(variance_explained_fp)+1),
variance_explained_fp.cumsum(), 'o-')
ax1[0].axhline(0.9, ls='--', color='g')
ax1[0].axvline(9, ls='--', color='g')
ax1[1].axhline(0.9, ls='--', color='g')
ax1[1].axvline(9, ls='--', color='g')
ax1[2].axhline(0.9, ls='--', color='g')
ax1[2].axvline(9, ls='--', color='g')
ax1[3].axhline(0.9, ls='--', color='g')
ax1[3].axvline(9, ls='--', color='g')
ax1[1].plot(range(1, len(variance_explained_can)+1),
variance_explained_can.cumsum(), 'o-')
ax1[2].plot(range(1, len(variance_explained_con)+1),
variance_explained_con.cumsum(), 'o-')
ax1[3].plot(range(1, len(variance_explained_all)+1),
variance_explained_all.cumsum(), 'o-')
ax1[0].set_ylim(0,1)
ax1[1].set_ylim(0,1)
ax1[2].set_ylim(0,1)
ax1[3].set_ylim(0,1)
plt.xlabel('number of PCs')
ax1[0].set_ylabel('cumulative variance explained')
ax1[1].set_ylabel('cumulative variance explained')
ax1[2].set_ylabel('cumulative variance explained')
ax1[3].set_ylabel('cumulative variance explained')
ax1[0].set_title("False Positive")
ax1[1].set_title("Candidate")
ax1[2].set_title("Confirmed")
Text(0.5, 1.0, 'Confirmed')
Comparing the cumulative variance explained accross all disposistions, we get about 90% of the cumulative variance explained with 9 PCs. The low reduction in dimensionality suggests that each subset of the data has a lot of variance in multiple features.
Singular Value Decomposition
To supplement the insights from other analysis, the team conducted SVD on the entire dataset and the team identified 2 prominent features that affect the 2 singular vectors. On the first singular vector, it is positively influenced by the insolation flux. The second singular vector is positively influenced by the transit depth, which is also known as stellar flux lost. It's perpendicularity suggest independence from each other.
from sklearn.decomposition import TruncatedSVD
def Tsvd(X):
svd = TruncatedSVD(n_components=13)
svd.fit(X)
cve = 0.80
n_components = 1 + np.argmax(np.cumsum
(svd.explained_variance_ratio_) >= cve)
X_new = svd.transform(X)[:, :n_components]
return X_new, svd.components_.T, svd.explained_variance_ratio_
_all_new_svd, w_all_svd, variance_explained_all_svd = Tsvd(all_df)
_fp_new_svd, w_fp_svd, variance_explained_fp_svd = Tsvd(fp_df)
_can_new_svd, w_can_svd, variance_explained_can_svd = Tsvd(can_df)
_con_new_svd, w_con_svd, variance_explained_con_svd = Tsvd(con_df)
fig, ax = plt.subplots(1, 2, subplot_kw=dict(aspect='equal'),
gridspec_kw=dict(wspace=0.4), dpi=150)
ax[0].scatter(_all_new_svd[:, 0], _all_new_svd[:, 1])
ax[0].set_xlabel('SV1')
ax[0].set_ylabel('SV2')
for feature, vec in zip(features, w_all_svd):
ax[1].arrow(0, 0, vec[0], vec[1], width=0.01, ec='none', fc='r')
ax[1].text(vec[0], vec[1], feature, ha='center', color='r', fontsize=5)
ax[1].set_xlim(-1, 1)
ax[1].set_ylim(-1, 1)
ax[1].set_xlabel('SV1')
ax[1].set_ylabel('SV2')
Text(0, 0.5, 'SV2')
Non-negative Matrix Factorization
X = np.array(_all_X)
from sklearn.decomposition import NMF
nmf = NMF()
U = nmf.fit_transform(X)
V = nmf.components_.T
fig, ax = plt.subplots()
ax.spy(V)
ax.set_xticks(range(len(features)))
ax.set_yticks(range(len(features)))
ax.set_yticklabels(features)
[Text(0, 0, 'koi_period'), Text(0, 1, 'koi_time0bk'), Text(0, 2, 'koi_impact'), Text(0, 3, 'koi_duration'), Text(0, 4, 'koi_depth'), Text(0, 5, 'koi_prad'), Text(0, 6, 'koi_teq'), Text(0, 7, 'koi_insol'), Text(0, 8, 'koi_model_snr'), Text(0, 9, 'koi_tce_plnt_num'), Text(0, 10, 'koi_steff'), Text(0, 11, 'koi_srad'), Text(0, 12, 'koi_kepmag')]
To add, we want to evaluate whether we can use clustering via lantent factors to identify confirmed exoplanets.
from sklearn.decomposition import PCA
pca = PCA(2)
plt.scatter(*pca.fit_transform(X).T, c=U.argmax(axis=1), cmap='Set1')
plt.xlabel('PC1')
plt.ylabel('PC2')
print(set(U.argmax(axis=1)))
{0, 1, 6, 7, 8, 9, 10, 11, 12}
# re-map koi-disposition to be 1 only if explanet is confirmed
all_df['LF'] = U.argmax(axis=1)
all_df.koi_disposition = all_df.koi_disposition.apply(lambda elem: 0 if (elem == 1 or elem == 0) else 1)
total = all_df.groupby(['LF']).koi_disposition.count()
confirmed = all_df.groupby(['LF']).koi_disposition.sum()
_df = pd.DataFrame({'confirmed': confirmed, 'total': total})
_df.plot.barh(width=0.85, figsize=(15, 5))
<AxesSubplot:ylabel='LF'>
plt.figure(figsize=(15, 5))
plt.bar(list(confirmed.index), confirmed/total)
plt.xlabel('Latent Factors')
plt.ylabel('% of Confirmed Exoplanets')
Text(0, 0.5, '% of Confirmed Exoplanets')
The low percentage of confirmed exoplanets suggests that the clusters created through latent factors is not an effective way to evaluate whether a candidate is a confirmed exoplanet.
Applying PCA on each of disposition shows us that indeed, confirmed exoplanets have fundamentally different important features than that of candidate and false positive exoplanets. Our SVD results indentifies generally the same important features for exoplanets. And lastly, we saw that clustering by latent factors is not an effective way to identify confirmed exoplanets.
Our interpretation of PCs probably have an equivalent physical quantity that can be measured or calculated. adding those quantities to our data might help describe the different observations(confirmed, candidate, or FP) that we have.
It was evident that the latent factors we got from our original features don't sufficiently describe our observations i.e., we got a low accuracy. Perhaps, by adding the new interpreted features, a better result from the clustering may be seen.
This research has made use of the NASA Exoplanet Archive, which is operated by the California Institute of Technology, under contract with the National Aeronautics and Space Administration under the Exoplanet Exploration Program
Cote, Jackson (2022, April 26). Exploring the dangers of microplastics. Northeastern University College of Engineering. Retrieved Dec 5, 2022 from https://coe.northeastern.edu/news/exploring-the-dangers-of-microplastics/
European Geosciences Union (2018, August 30). Deadline for climate action: Act strongly before 2035 to keep warming below 2°C. ScienceDaily. Retrieved December 5, 2022 from www.sciencedaily.com/releases/2018/08/180830084818.htm